Coresets for k-Segmentation of Streaming Data

نویسندگان

Guy Rosman

Mikhail Volkov

Dan Feldman

John W. Fisher

Daniela Rus

چکیده

Life-logging video streams, financial time series, and Twitter tweets are a few examples of high-dimensional signals over practically unbounded time. We consider the problem of computing optimal segmentation of such signals by a k-piecewise linear function, using only one pass over the data by maintaining a coreset for the signal. The coreset enables fast further analysis such as automatic summarization and analysis of such signals. A coreset (core-set) is a compact representation of the data seen so far, which approximates the data well for a specific task – in our case, segmentation of the stream. We show that, perhaps surprisingly, the segmentation problem admits coresets of cardinality only linear in the number of segments k, independently of both the dimension d of the signal, and its number n of points. More precisely, we construct a representation of sizeO(k log n/ε) that provides a (1+ε)approximation for the sum of squared distances to any given k-piecewise linear function. Moreover, such coresets can be constructed in a parallel streaming approach. Our results rely on a novel reduction of statistical estimations to problems in computational geometry. We empirically evaluate our algorithms on very large synthetic and real data sets from GPS, video and financial domains, using 255 machines in Amazon cloud.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Coresets for k-Segmentation of Streaming Data Supplementary Material

In this supplementary material we detail the construction, properties, and proofs for a k-segment mean coreset that allows efficient segmentation of high-dimensional signals. We define the ksegment mean problem in Section 2. We describe a coreset for the 1-segment mean in Section 3. We show why a similar construction is not possible for the k-segment mean problem in Section 4. In Sections 5,6,7...

متن کامل

Core-Preserving Algorithms

We define a class of algorithms for constructing coresets of (geometric) data sets, and show that algorithms in this class can be dynamized efficiently in the insertiononly (data stream) model. As a result, we show that for a set of points in fixed dimensions, additive and multiplicative ε-coresets for the k-center problem can be maintained in O(1) and O(k) time respectively, using a data struc...

متن کامل

Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

We prove that the sum of the squared Euclidean distances from the n rows of an n×d matrix A to any compact set that is spanned by k vectors in R can be approximated up to (1+ε)-factor, for an arbitrary small ε > 0, using the O(k/ε)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1+ε)approximated by an optimal k-means cl...

متن کامل

On k-Median clustering in high dimensions

We study approximation algorithms for k-median clustering. We obtain small coresets for k-median clustering in metric spaces as well as in Euclidean spaces. Specifically, in IR, those coresets are of size with only polynomial dependency on d. This leads to a (1 + ε)-approximation algorithm for kmedian clustering in IR, with running time O(ndk + 2 O(1) dn), for any σ > 0. This is an improvement ...

متن کامل

Streaming Algorithms for k-Means Clustering with Fast Queries

We present methods for k-means clustering on a stream with a focus on providing fast responses to clustering queries. When compared with the current state-of-the-art, our methods provide a substantial improvement in the time to answer a query for cluster centers, while retaining the desirable properties of provably small approximation error, and low space usage. Our algorithms are based on a no...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Coresets for k-Segmentation of Streaming Data

نویسندگان

چکیده

منابع مشابه

Coresets for k-Segmentation of Streaming Data Supplementary Material

Core-Preserving Algorithms

Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

On k-Median clustering in high dimensions

Streaming Algorithms for k-Means Clustering with Fast Queries

عنوان ژورنال:

اشتراک گذاری